Skip to main content
Version: 1.7.x

Use Job to Run Agent

This is the reference for Job, one of ROCK's two parallel ways to use agents. Its core API is rock.sdk.job.Job with JobConfig, used to run an agent evaluation/task in a sandbox. Two backends are supported: Bash Job and Harbor Bench Job.

The other way is to install and run an agent inside a single sandbox — see Install Agent in Sandbox. The two ways use distinct config schemas — do not mix them.

rock.sdk.job exposes a single Job API that supports two modes, distinguished by the config type:

  • Bash Job: Runs an arbitrary shell script inside a sandbox — useful for data processing, external evaluation tools, etc.
  • Harbor Bench Job: Runs an AI agent benchmark task via the Harbor framework (SWE-bench, Terminal Bench, etc.).

End-to-End Example

A minimal runnable Python snippet:

import asyncio
from rock.sdk.job import Job, JobConfig

async def main():
config = JobConfig.from_yaml("swe_job_config.yaml") # contains agents: and datasets:
result = await Job(config).run()

print(f"status={result.status}, score={result.score}")
for trial in result.trial_results:
print(f" {trial.task_name}: score={trial.score} ({trial.status})")

asyncio.run(main())

The full yaml template is in examples/job/harbor/swe_job_config.yaml.template.


Bash Job

Bash Job is for running arbitrary shell scripts inside a sandbox — running an external evaluation tool, processing data, etc.

Full example: examples/job/bash/claw_eval/

  • run_claw_eval.py — Entry point demonstrating JobConfig.from_yaml() + Job(config).run()
  • claw_eval_bashjob.yaml.template — YAML template with script_path, environment, uploads, env, etc.
  • run_claw_eval.sh — The script that actually runs in the sandbox (DinD startup, log writing, score output)

BashJobConfig Fields

FieldTypeDefaultDescription
scriptstr | NoneNoneInline script content (mutually exclusive with script_path)
script_pathstr | NoneNoneLocal script path; the file is read and uploaded at runtime
job_namestrcurrent timestampName used for log and artifact paths
environmentEnvironmentConfigSandbox connection and resource config (see below)
namespacestr | NoneNoneNamespace
experiment_idstr | NoneNoneExperiment ID
timeoutint7200Overall timeout in seconds (2 hours)

Common environment fields:

FieldTypeDescription
imagestrSandbox Docker image
base_urlstrROCK platform URL
xrl_authorizationstrAuth token
clusterstrTarget cluster
memorystrMemory size (e.g. "64g")
cpusintNumber of CPUs
auto_stopboolWhether to stop the sandbox after the job
uploadslistLocal-to-sandbox file/dir uploads, format: [local_path, sandbox_path]
envdict[str, str]Environment variables injected into the sandbox session

Harbor Bench Job

Harbor Bench Job runs AI agent benchmark tasks like SWE-bench and Terminal Bench via the Harbor framework.

Note: rock.sdk.bench.Job is deprecated and will be removed in a future release. Use rock.sdk.job.Job + HarborJobConfig instead.

Full example: examples/job/harbor/

  • harbor_demo.py — Entry point demonstrating JobConfig.from_yaml() + Job(config).run() + result iteration
  • swe_job_config.yaml.template — SWE-bench task config template
  • swe_job_config-verifier.yaml.template — Variant with verifier.mode: native
  • tb_job_config.yaml.template — Terminal Bench task config template

HarborJobConfig Core Fields

Basic fields:

FieldTypeDefaultDescription
experiment_idstrrequiredExperiment ID — required by Harbor
job_namestr | Noneauto-generatedFormat: {dataset}_{task}_{uuid[:8]}
namespacestr | NoneNoneNamespace, auto-filled from the sandbox
environmentRockEnvironmentConfigSandbox connection and resource config

Execution control:

FieldTypeDefaultDescription
n_attemptsint1Attempts per Trial
timeoutint7200Overall timeout (auto-derived from agent_timeout)
debugboolFalseDebug mode — keeps more intermediate artifacts

Components:

FieldTypeDescription
agentslist[AgentConfig]Harbor's own agent config (typical fields: name, model_name) — see examples/job/harbor/swe_job_config.yaml.template for the canonical shape
datasetslist[DatasetConfig]Dataset configs
verifierVerifierConfigVerifier evaluation config
orchestratorOrchestratorConfigConcurrency / scheduling config

Result Handling

Both Job modes return a JobResult:

result = await Job(config).run()

print(f"status={result.status}, score={result.score}")
for trial in result.trial_results:
print(f" {trial.task_name}: score={trial.score} ({trial.status})")
if trial.exception_info:
print(f" {trial.exception_info.exception_type}: {trial.exception_info.exception_message}")

JobResult Fields

Field / PropertyTypeDescription
statusJobStatusOverall task status
trial_resultslist[TrialResult]List of all Trial results
scorefloat (property)Average score across all Trials
n_completedint (property)Number of Trials with status completed
n_failedint (property)Number of Trials with status failed

TrialResult Fields

Field / PropertyTypeDescription
task_namestrTask name
exit_codeintProcess exit code
raw_outputstrRaw process output
exception_infoExceptionInfo | NonePopulated if an exception occurred
statusstr (property)"completed" or "failed"
duration_secfloat (property)Execution time in seconds
scorefloat (property)Score (Bash Job defaults to 0.0; Harbor mode comes from the verifier)